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Abstract 

Global temperature variations between 1861 and 1984 are forecast using regularization network, mul¬ 
tilayer perceptrons, linear autoregression, and a local model known as the simplex projection method. 
The simplex projection method is applied to characterize complexities in the time series in terms of the 
dependence of prediction accuracy on embedding dimension and on prediction-time interval. Nonlinear 
forecasts from the library patterns between 1861 and 1909 reveal that prediction accuracies are optimal 
at the embedding dimension of 4 and deteriorate with prediction-time interval. Regularization network, 
backpropagation, and linear autoregression are applied to make short term predictions of the meteorolog¬ 
ical time series from 1910 to 1984. The regularization network, optimized by stochastic gradient descent 
associated with colored noise, gives the best forecasts. For all the models, prediction errors noticeably 
increase after 1965. These results are consistent with the hypothesis that the climate dynamics is charac¬ 
terized by low-dimensional chaos and that the it may have changed at some point after 1965, which is also 
consistent with the recent idea of climate change. However, care must be taken of such an interpretation 
in that a time series of colored noise with few data points that has zero mean and many degrees of freedom 
can also show a similar behavior. 
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Introduction 




intrinsically stochastic. However, since the observation 
are correlated in time, it might be still predictable to 
some extent. 

It is usually very difficult to discover, from a time se¬ 
ries, what kind of mechanism generated the data, but 
building models of the type (1) is often useful to under¬ 
stand the relevant variables of the system and its de¬ 
gree of randomness. Until not many years ago most of 
the models used to analyze time series were linear, and 
therefore limited to succeed on the class of linear prob¬ 
lems. However, recent progress in theoretical and com¬ 
putational aspects in nonlinear problems has enabled us 
to characterize the degrees of randomness in nonlinear 
systems and to forecast nonlinear dynamical behavior. A 
number of authors, among which Farmer and Sidorowich 
(1987), Casdagli (1989), Sugihara and May (1990), have 
applied and developed novel nonlinear regression tech¬ 
niques to predict chaotic time series, and shown that 
short-term predictability can often be achieved with a 
good degree of accuracy. They have also demonstrated 
that low-dimensional chaos can be distinguished from 
white noise according to nonlinear forecasts of the dy¬ 
namical behavior. 

One could find the minimal embedding dimension and 
valid prediction-time interval through applying a vari¬ 
ety of network structures by trial and error of cross- 
validation techniques. Such procedure, however, is likely 
to be computationally expensive, and one of the ways to 
circumvent such problem is to resort to local approxima¬ 
tion such as the algorithm of Sugihara and May (1990). 
In local approximation, predictions are made from only 
nearby states in the past, so that it consumes less com¬ 
putational time. The local approximation, however, has 
shortcomings that the mapping is discontinuous, which 
results in less sufficient prediction accuracy than global 
approximation such as neural networks. Thus the asso¬ 
ciation of the local approximation with neural networks 
is an effective way for building a model to make predic¬ 
tions of complex time series. One could find the min¬ 
imal embedding dimension and the valid prediction in¬ 
terval using the local approximation without consuming 
a lot of computational time and forecast the time series 
with good prediction accuracy using neural networks the 
structures of which are determined according to the fore¬ 
casts by the local approximation. 

2.1 Global temperature time series and the 
climate attractor 

It has been controversial whether the climate is low¬ 
dimensional chaos or not (Nicolis and Nicolis, 1984; 
Grassberger, 1984; Essex, Lookman and Nerenberg, 
1987; Lorenz, 1991). Nicolis and Nicolis (1984) first 
claimed the existence of a low-dimensional climate at¬ 
tractor in terms of the correlation dimension of the 
time series synthesized by interpolations from an isotope 
record of deep-sea cores. Grassberger (1984), however, 
argued that their estimate may reflect not the actual 
climatic dynamics but the artifact due to the interpola¬ 
tion. Meanwhile, Essex et al. (1987) published a calcu¬ 
lation on the correlation dimension for non-Hltered time 


series of daily geopotential observations. They agreed 
with the existence of such climate attractor. Recently 
Lorenz (1991) expressed doubts on the interpretations 
of the previous calculations. He claimed hat a low¬ 
dimensional attractor is unlikely to exist for the global 
climate, although the climatic subsystems could be low¬ 
dimensional chaos. 

3 Is there a global climatic attractor? 

In this section we first discuss the plausibility of the 
hypothesis that there is low-dimensional chaotical sys¬ 
tem underlying the global temperature time series, and 
then present some results about its short-term predic¬ 
tions with a number of methods. We use a data set 
published by Jones et al. (1986), who synthesized global 
mean surface air temperature differences between suc¬ 
cessive years by correcting non-climatic factors in near¬ 
surface temperature data over the land and the oceans of 
both hemispheres. Although some questions such as the 
influence of urbanization effect are raised about the sig¬ 
nificance of the data (Wu, Newell and Hsuing, 1990) we 
still think it is of interest to see what kind of information 
can be extracted from this data set. 

3.1 Testing for chaos 

We have used the technique proposed by Sugihara and 
May (1990) to understand the nature of the time se¬ 
ries we decided to analyze. The technique consists in 
studying the behaviour of the correlation coefficient be¬ 
tween the prediction and the target as a function of how 
many steps in the future we are trying to predict, and 
of the embedding dimension. Sugihara and May argued 
that looking at the rate of decrease to zero of the cor¬ 
relation coefficient it is possible to distinguish chaotic 
systems from non-chaotic ones. In our experiments, fol¬ 
lowing Sugihara and May, predictions have been done 
according to the simplex projection technique described 
in appendix (A.4). 

Figures (2) and (3) show the Sugihara-May technique 
applied to a number of synthetic time series. Figure (2) 
shows a plot of the correlation coefficient as a function of 
embedding dimension for predictions of: a) white noise 
(open triangles and dashed line); b) / _1 6 -noise (solid 
triangles and dotted line); c) Henon mapping superim¬ 
posed on white noise (solid squares and short-dashed 
line), d) sine wave superimposed on white noise (open 
circles and long-dashed line). The white random noise 
has been synthesized using equation (8) in appendix (B) 
with a = 0. In Figure (2), predictions have been made 
on a test set of 150 vectors from a data set of 150 exam¬ 
ples. 

Figure (3) shows a plot of the correlation coefficient 
as a function of prediction-time step for the time series 
shown in figure (2). In figure (3), predictions have been 
done assuming an embedding dimension n = 3, in similar 
way to Figure (2). 

The results in figure (2) and figure (3) reveal charac¬ 
teristics of complexities in each time series. The Henon 
mapping superimposed on white noise has a peak em¬ 
bedding dimension and its prediction accuracy deterio¬ 
rates quickly with prediction-time step. This is typical of 



low-dimensional chaos. The sine wave superimposed on 
white noise has long-term predictability independently of 
embedding dimension. White noise is unpredictable at 
all. It should be noticed that the / _1 6 -noise has short¬ 
term predictability independently of embedding dimen¬ 
sion. According to some recent work (Miyano, 1994; 
Miyano et al. 1992), such characteristics are also found 
for an actual colored noise observed for a semiconductor 
device. The flat relation between the correlation coeffi¬ 
cient and embedding dimension is an important indica¬ 
tor in distinguishing low-dimensional chaos from colored 
noise. 

We applied the same techniques described above to the 
time series observed by Jones et al. (1986). The time 
series has been obtained by hand-scanning the original 
figures on the paper. Therefore, the present time series 
includes some read error. We use the first 45 input vec¬ 
tors, i.e., the data from 1861 to 1915, as training data, 
being motivated by the idea that a climate change may 
have occurred around the mid 20th-century by factors 
such as increasing energy consumption and environmen¬ 
tal pollution. The correlation coefficient as a function of 
the embedding dimension for predictions of the meteo¬ 
rological time series is presented in Figure (4). Forecasts 
up to 1944 and up to 1984 are shown by solid circles 
and solid line, and by open circles and dashed line, re¬ 
spectively. Notice that the plot has a peak at n = 4, 
suggesting that this time series has been generated by 
a dynamical system with a low dimensional attractor of 
dimension 4. 

A plot the correlation coefficient versus the 
prediction-time step is shown in Figure (5), where the 
embedding dimension has been set to 4, according to 
the results of figure (4). Forecasts from 1910 to 1944 
and from 1910 to 1984 are indicated by solid circles and 
solid line, and by open circles and dashed line, respec¬ 
tively. The prediction accuracy deteriorates rapidly with 
increasing time step in both cases. This indicates that 
the time series has only short-term predictability. The 
plots shown in Figure (4) and Figure (5) appear to be 
typical of a low-dimensional chaotic time series. Such di¬ 
agnosis, however, is dangerous, since colored noise with 
many degrees of freedom can also provide a similar trend 
in prediction, when handling relatively small number of 
data points. 

Figure (6) and Figure (7) show a plot of the corre¬ 
lation coefficient versus the embedding dimension and 
prediction-time step respectively for f~ 18 (Miyano, 
1994; Miyano et al. 1992). The fractal dimension 
was estimated using the algorithm developed by Higuchi 
(1988). The power spectrum of the random noise can be 
described as / _1 8 with respect to frequency /, accord¬ 
ing to the relation between the fractal dimension D and 
the power law index a of D = (5 — a)/2. In both 
Figure (6) and Figure (7), solid circles and solid line cor¬ 
respond to training and testing set size of 50, while open 
circles and dashed line correspond to training and test 
size of 500. Notice the sensitivity to the number of data. 


4 Forecasting the global temperature 
time series 

We tested three different approximation techniques to 
make predictions one step ahead, with an embedding 
dimension equal to 4: 

1. Gaussian Hyper Basis Functions with a linear term. 
A network of 5 Gaussian units has been used, to 
which a linear term has been added. The linear 
term has been computed first, and then the residu¬ 
als have been approximated with the gaussian net¬ 
work. Minimization of the mean square error was 
run over the coefficients, the centers and the vari¬ 
ances of the gaussians, for a total of 30 free param¬ 
eters. 

2. Multilayer Perceptron with one hidden layer. A 
standard Multilayer Perceptron network was used 
for the prediction, with 4 hidden sigmoidal units, 
for a total of 24 parameters. 

3. linear regression. A linear regression model was fit¬ 
ted to the data, using the statistical package Splus 
(Becker, Chambers and Wilks, 1988). 

We use the first 45 data points, i.e., the time series 
from 1861 to 1909, as training set, and tested the re¬ 
sulting approximation on three different test sets: 1910- 
1944, 1910-1964, 1910-1984. For each experiment we 
measured the root mean square error s: 



where the sum runs over the elements of the set being 
tested, x a are test values and x a are the values predicted 
by the model. We also measured for each test set the 
variance 

k 

a = , ^(*«— < x >) 2 

\ a= 1 

where < x > is the average value of the test set. Notice 
that the quantity ^ is particularly significant, because 
if it has value zero then predictions are perfect, while if 
it is equal to 1 then predictions are no better that the 
average of the targets. 

The experimental results for £ and a have been reported 
in table (1), while the forecasts are shown in figures (8), 
(9) and (10) respectively. In the upper part of the figures 
we displayed the observed time series, represented by a 
dashed line, and the trained model, represented by the 
solid line. Solid circles have been used for the test set 
and white circles for the training set. In the lower part of 
the figures the residuals of the approximation have been 
shown. 

It is clear that the Hyper Basis Function model makes 
best forecasts and that the linear regression makes the 
worst. This suggest that the dynamical behavior of the 
time series is nonlinear. It should be noticed, however, 
that prediction error increases remarkably after 1965 for 
all the models, although it is more pronounced in the 




Linear 

HBF 

MLP 

e (training) 

0.10 

0.09 

0.90 

e (1910-1944) 

0.14 

0.10 

0.12 

e (1910-1964) 

0.14 

0.10 

0.13 

e (1910-1984) 

0.16 

0.12 

0.15 

e/a (training) 

0.84 

0.83 

0.81 

e/a (1910-1944) 

1.13 

0.82 

1.01 

e/a (1910-1964) 

1.25 

0.88 

1.11 

e/a (1910-1984) 

1.31 

0.98 

1.18 


Table 1: Root-mean-squared prediction error (e) and 
normalized Root-mean-squared prediction error (e/<r) 
for the 3 technique we tested. See text for explanation 


Multilayer Perceptron and linear regression. This is not 
inconsistent with the idea of global warming suggested 
by Jones et al. In fact, assume that a model is success¬ 
ful in learning the dynamics underlying the meteorologi¬ 
cal time series that correspond to the period 1861-1909. 
Then, one possible interpretation for the increase in the 
prediction error after 1965 is that a change in the climate 
dynamics took place at some point after 1965. 

Such straightforward interpretation could be, how¬ 
ever, a misdiagnosis, since the trend in prediction error 
could be due to failure in generalizing the underlying dy¬ 
namics. In fact, according to our recent work (Miyano, 
1994), a similar trend in prediction error can also be 
present in forecasting colored noise with relatively few 
data points. Figure (11) shows the / _1 8 -noise observed 
for a semiconductor device (Miyano et al. 1992). In Fig¬ 
ure (12), we present results of learning and predictions 
of the random noise by regularization network with 3 
input nodes and 5 hidden nodes without linear terms. 
Figure (13) shows residuals obtained by subtracting the 
corresponding target from each prediction. We use first 
250 points in the training set. In this case, the opti¬ 
mal embedding dimension is assumed to be 3 accord¬ 
ing to the results shown in Figure (7). The predictions 
agree well with the targets except for the portion from 
time = 300 to 350 during which the network forecasts 
lower values than observed. From Figures (12) and (13), 
one would make a misdiagnose that the time series was 
low-dimensional chaos and that the discrepancy between 
time = 300 and 350 indicated some change in the dy¬ 
namics. The discrepancy is, however, clearly due to fail¬ 
ure in generalization. 

5 Discussion and open problems 

The present considerations on the meteorological time 
series leads to two possible interpretations of the global 
climate: one is that a low-dimensional climate attrac¬ 
tor may exist and that the climate dynamics may have 
altered at some point after 1965; the other is that the 


temperature variations may be colored noise with many 
degrees of freedom. The latter interpretation would lead 
to the following forecast of the future trend in the cli¬ 
mate: the global temperature would begin to decrease at 
some point in the future, since colored noise has a zero 
mean in a long time period. Within the framework of 
the present study we can dismiss neither of the interpre¬ 
tations. The present work is still in a preliminary stage. 
In a future paper, we plan to forecast non-hand-scanned 
meteorological time series with more data points and to 
clarify whether the climate is low-dimensional chaos or 
not. 
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A Regression techniques 

In this section we describe the different regression tech¬ 
niques that we have used to forecast and analyze the 
global temperature time series. The first three are based 
on the assumption that the data can be well approxi¬ 
mated by a parametric function of the form 

n 

/(x) = y^c a H(x;w a ) 

a = 1 

where the c a and the w„ are considered free parameters 
and are found by a least squares technique. Models of 
this type are usually called “neural networks”, because 
they can be represented by a network with one layer 
of hidden units. The last technique is a local technique, 
similar in spirit to nearest neighbor models and to kernel 
regression. 

A.l Linear model 

The simplest model of the form (1) that we can build is 
one in which the function / is linear, and therefore the 
relation between the past values of x and the future ones 
is of the type: 

x(t) = a 0 +aix(t-l)+a 2 x(t-2)-\ - \-a n x(t—n)+£ t (3) 

This is the so called AR (autoregressive) model, and is 
one of the many linear models that have been developed 
in time series analysis. There is clearly a huge litera¬ 
ture about linear models (see for example Myers, 1986; 
Draper and Smith, 1981; Searle, 1971) and we will not 
spend more time on this topic in this paper. In our ex¬ 
periments the coefficients of the model have been com¬ 
puted according to standard least square routines, that 
correspond to assuming that the random variables in 
eq. (3) are independent and have Gaussian distribution 
with zero mean. The statistical package Splus (Becker, 
Chambers and Wilks, 1988) has been used to perform 
this calculation. 



A.2 Gaussian Hyper Basis Functions with a 

linear term 

One of the parametrizations we used for the prediction 
task had the following form: 

n 

/(x) = ^c«e-»“H x - t “ll 2 +a.x + d. (4) 

a = 1 

The coefficients c a , a, d, the centers t a , the widths w a of 
the gaussians were considered as free parameters. Tech¬ 
niques of this type are quite common in the neural net¬ 
work literature (Moody and Darken, 1989; Poggio and 
Girosi, 1990), although the linear term has not beed 
used very often. When the widths of gaussians have 
all the same value the technique is a particular case of 
is a particular case of a general class of approximation 
techniques, called regularization networks, that share the 
property that they can be justified in the framework of 
regularization theory (Girosi, Jones and Poggio, 1993) 
and can be represented by a network with one layer of 
hidden units. 

Usually the parameters of the network are found by 
minimizing the mean square error by some numerical 
minimization technique. In our case, since the number 
of data points available is very small, we do not expect to 
be able to model surfaces much more complicated than 
an hyperplane. Therefore, we first fit an hyperplane to 
the data (the term a • x + d), and then estimate the 
parameters of the rest of the expansion by fitting the 
residuals of the hyperplane approximation, choosing a 
small number n of basis functions. The gaussian basis 
functions are therefore to be considered as a “correc¬ 
tive term” to a linear approximation technique. The 
minimization technique we used for estimating the pa¬ 
rameters of the gaussian basis functions is a stochastic 
gradient descent algorithm, described in appendix B. 

A.3 Multilayer perceptrons 

Multilayer perceptrons (MLP) are an approximation 
technique that is based, in its most common form, on 
the following parametric representation: 

n 

/(x) = ^2 c 8 '(t(x • W; + 6 {) , 

8 = 1 

where a is a sigmoidal function, that in our case we 
set to , , * , that is one the most common choices. In 

our computations the parameters of the network have 
been estimated using backpropagation and the general¬ 
ized delta rule (Rumelhart, Hinton and Williams, 1986; 
Hecht-Nielsen, 1989), that is a form of stochastic gradi¬ 
ent descent. 


sampling an unknown function / in presence of noise. 
When the value of / at a point x that does not belong 
to the data set has to be computed, first its closest d + 1 
points x 8l , . . . x; d+1 in the data set are found. Then the 
value of the function at x is estimated by a weighted 
average of the values of the function at the data points 
x 8l , . . . x; d+1 , that is the values j/ 8l , . . . yi d+1 . Points that 
are more far from x receive a smaller weight, according to 
an exponential decay. In formulas, the estimated value 
of / at x is 


/(x) 



X ^ d + 1 p-Crda, 
Z=a=l e 


where we have defined 


d a = ||x - XiJ| 

and a is a parameter that define the locality of the tech¬ 
nique, and can be set with cross-validation techniques. 
This technique can be considered as an approximation 
of kernel regression with e~ x as a kernel. In fact in this 
case kernel regression would have the form: 


/(x) 


Ef = 1 ^e-Hx-x.|| 

y N , e- CT l|x-xdl 

/ J% — 1 


(5) 


The method we use in this paper is equivalent to take 
only the closest d + 1 points in the expansion (5). 


B Stochastic gradient descent 

Let H(a) be a function that has to be minimized with 
respect to the set of parameters £. The standard gradi¬ 
ent descent algorithm is an iterative algorithm in which 
the set of parameters £ evolves toward the minimum ac¬ 
cording to the following law: 

* = ( 6 ) 

dt d( { J 

where t is a time parameter in the iteration loop of the 
optimization process and u> > 0 is called learning rate. 
One of the many inconveniences of this algorithm is that, 
if converges, it converges to a local minimum. Since in 
the optimization problems arising from the training of a 
neural network it is known that multiple local minima 
are present, it is important to have a technique that is 
able to escape at least some of the local minima and get 
close to the global minimum. A simple way to achieve 
this consists in adding a stochastic term in eq. (6), that 
becomes: 


A.4 Simplex projection method 

The simplex projection method, that has been used by 
Sugihara and May (1990) to tell chaotic time series from 
non chaotic ones, is a local approximation method, very 
close to the fc-nearest neighbour technique and kernel 
regression. Suppose that a data set of N data points in 
d dimensions has been given, consisting of input-output 
pairs {(xj, yi)}fL 1 that have been obtained by randomly 


d£ 

dt 


dH(Q 

OH 


T](t) 


(7) 


where rj(t) is random noise. As an effect of the addition 
of the noise term rj the set of parameters £ will not always 
follow the direction of the gradient, and will sometime 
go uphill instead of downhill, having a chance to escape 
some local minima. In our case the random noise rj(t) 
was synthesized by the following equation: 



X 

T](t) = £ (2irck) 2 [Acos(2irckt) + Bsin(2irckt)\ (8) 
k =1 

where c and X are set to 0.01 and 20000, respectively, 
and A and B random numbers lying between 0 and 1. 
The power spectrum of rj(t) is given as with respect 
to the frequency /. In order to verify the validity of 
the algorithm, we adapt it to optimizing the regulariza¬ 
tion network to predict a chaotic time series synthesized 
by Henon mapping (embedding dimension is set to 2. 
White noise (i.e., a = 0), / _1 -noise (pink noise), and 
/ _2 -noise are used as perturbation. It is found that / -1 - 
noise works as well as white noise, while / _2 -noise does 
not. The value of the normalized root mean square er¬ 
ror e/<7 for a test set of 1000 points and a training set of 
1000 examples was 0.097 for a network with 5 basis func¬ 
tions and no linear term, in the case in which / _1 -noise 
on 1000 examples. This good result confirmed that the 
stochastic gradient descent algorithm worked correctly 
and achieved some good local minimum. 
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Embedding dimension 

Figure 2: Plot of the correlation coefficient as a function 
of embedding dimension for various synthetic time series: 
white noise (open triangles and dashed line); / -1 b -noise 
(solid triangles and dotted line); Henon mapping super¬ 
imposed on white noise (solid squares and short-dashed 
line); and sine wave superimposed on white noise (open 
circles and long-dashed line). Predictions on a test set 
of 150 points were made from a training set of 150 ex¬ 
amples. 



Embedding dimension 

I i" ii re 4: Plot of the correlation coefficient as a function 
of embedding dimension for the meteorological time se¬ 
ries. Predictions are made from the first 45 data points, 
i.e., the data from 1861 to 1915. Forecasts up to 1944 
and up to 1984 are shown by solid circles and solid line, 
and by open circles and dashed line, respectively. 




Time step Time step 


Figure 3: Plot of the correlation coefficient as a function 
of prediction-time steps for various synthetic time series; 
white noise (open triangles and dashed line); / -1 b -noise 
(solid triangles and dotted line); Henon mapping super¬ 
imposed on white noise (solid squares and short-dashed 
line); and sine wave superimposed on white noise (open 
circles and long-dashed line). Predictions on a test set 
of 150 points were made from a training set of 150 ex¬ 
amples. The embedding dimension is set to 3. 


Figure 5: Plot of the correlation coefficient as a function 
of prediction-time step for the meteorological time series. 
Forecasts are made from the first 45 data points, i.e., the 
data from 1861 to 1909. Forecasts from 1910 to 1944 and 
from 1910 to 1984 are shown by solid circles and solid 
line, and by open circles and dashed line, respectively. 
The embedding dimension has been set to 4, according 
to the results of fig. (4). 
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Figure 9: Forecasts of the meteorological time series 
by backpropagation network: forecasts(upper part); and 
residuals obtained by subtracting the- corresponding tar¬ 
get from each prediction (lower part). The network has 
been trained on the first. 45 data, points, i.e., the time 
series from 1861 to 1909. Solid circles and solid line in¬ 
dicate predictions on the input, vectors that the network 
has not. seen. Open circles and solid line indicate results 
of training. The observed time series is shown by dashed 
lines. 
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Figure 10: Forecasts of the meteorological time series by 
linear autoregression: forecasts(upper part); and residu¬ 
als (lower part). The model has been trained on the first. 
45 data, points, i.e., the time series from 1861 to 1909. 
Solid circles and solid line indicate predictions on the in¬ 
put. vectors that the network has not. seen. Open circles 
and solid line indicate results of fitting. The observed 
time series is shown by dashed lines. 
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